Skip to content

Harden NexusTokens for polymorphism with internal whitespace; add NexusTokensToInteger()#269

Merged
ms609 merged 6 commits into
mainfrom
fix-polymorphism-whitespace
May 18, 2026
Merged

Harden NexusTokens for polymorphism with internal whitespace; add NexusTokensToInteger()#269
ms609 merged 6 commits into
mainfrom
fix-polymorphism-whitespace

Conversation

@ms609

@ms609 ms609 commented May 18, 2026

Copy link
Copy Markdown
Owner

Summary

  • Regression coverage for Cingulata-style polymorphism tokens with internal whitespace (e.g. (1 2), {0 1}, including continuation lines). No parser fix was needed: gsub(\" \", \"\", ...) at R/parse_files.R:88 already strips internal whitespace before NexusTokens() sees the tokens, so (1 2) becomes (12). The new test in tests/testthat/test-parsers.R locks that behaviour down.
  • New exported helper NexusTokensToInteger() (R/parse_files.R:887): converts the character matrix from ReadCharacters() (or a vector from NexusTokens(), or a phyDat object) to an integer matrix.
    • polymorphism = c(\"?\", \"first\", \"last\") — default \"?\" treats polymorphic tokens like NEXUS missing data (→ NA_integer_). \"first\"/\"last\" take the corresponding state digit inside the brackets.
    • phyDat input is routed through PhyDatToMatrix(ambigNA = TRUE, inappNA = TRUE) so fully-ambiguous and inapplicable rows become NA_integer_ and only true partial polymorphism is subject to the polymorphism rule.
    • Only digit states 0..9 are recognised; non-digit symbols (and any token whose interior contains no digits) map to NA_integer_. Documented in @details.
  • Bug fix in the polymorphism extractor. The earlier draft used regmatches(x[ambig], regexpr(...)), which silently dropped no-match elements and crashed (or recycled wrongly) on tokens like \"(AB)\" or \"()\". Replaced with a length-preserving pattern that pads via the m != -1L mask. Regression test added.
  • TNT test cleanup — drops a handful of dead TNT fixtures and test-ReadTntTree.R test cases that no longer exercise live code paths after the parser simplification on this branch.
  • Roxygen 8 redoc of all man/*.Rd files.
  • Version bumped to 2.3.0.9000 (development) with a NEWS entry.

Test plan

  • devtools::document() regenerates NAMESPACE (export added at line 375) and man/NexusTokensToInteger.Rd.
  • devtools::test(filter = \"parsers\") — all parser tests pass, including the new Cingulata block, the round-trip from ReadCharacters() into NexusTokensToInteger(), the no-digit polymorphism regression, the named-vector input case, attribute preservation for state.labels, and the phyDat → integer routing.
  • Full devtools::test() — no regressions. Skipped tests are pre-existing slow-only.
  • Reviewer: please run R CMD check locally if you'd like full CRAN-grade assurance before merging.

🤖 Generated with Claude Code

ms609 and others added 6 commits May 18, 2026 10:31
…tespace

No parser fix was needed: the existing gsub(" ", "", ...) at parse_files.R:88
already strips internal whitespace from polymorphism tokens before NexusTokens()
sees them, so (1 2) -> (12) and {0 1} -> {01} already worked correctly.

Added:
- Regression test covering (1 2) / {0 1} polymorphism and multi-line matrix
  continuation in test-parsers.R.
- NexusTokensToInteger(): a new exported helper that converts the character
  matrix from ReadCharacters() to integer, mapping polymorphic/ambiguous/?/-
  tokens to NA_integer_ by default, or extracting the first/last state digit
  under polymorphism = "first"/"last". Tests included.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolve conflicts from PR #268 (TNT parser fixes) landing on main while
this branch was open:

- DESCRIPTION: bump to 2.3.0.9002 (past main's 9001); drop duplicate
  Config/roxygen2/version line introduced by the merge.
- NEWS.md: consolidate dev entries from both sides under the new header.
- inst/extdata/tests/tnt-*.tnt, tests/testthat/test-ReadTntTree.R: keep
  main's versions; the branch's earlier deletions were premature.
- man/*.Rd: regenerated via devtools::document() so @family references
  include both NexusTokensToInteger() (this branch) and the TNT
  additions from main.

R/parse_files.R auto-merged cleanly. Full devtools::test() green.
@github-actions

github-actions Bot commented May 18, 2026

Copy link
Copy Markdown

⚠️ This benchmark result is outdated. See the latest comment below.

Performance benchmark results

Call Status Change Time (ms)
as.Splits(bigTrees) ⚪ NSD -0.55% 30.2 →
31.1, 30.3
as.Splits(someTrees) ⚪ NSD -0.76% 9.93 →
9.98, 10
Consensus(forest1k.888, check = FALSE) ⚪ NSD 0.73% 88.2 →
87.6, 87.9
Consensus(forest201.80, check = FALSE) ⚪ NSD -0.69% 4.02 →
4.02, 4.07
Consensus(forest21.260, 0.5, FALSE) ⚪ NSD 0.38% 1.13 →
1.12, 1.13
Consensus(forest21.260) ⚪ NSD 0.42% 1.13 →
1.12, 1.12
Consensus(forestMaj, 0.5, FALSE) ⚪ NSD 0.45% 2.88 →
2.85, 2.89
DropTip(tr2000, 5) ⚪ NSD -0.52% 26 →
26.3, 25.9
DropTip(tr80, 5) ⚪ NSD 0.02% 0.0893 →
0.0896, 0.089
DropTip(unlen2k, 5) ⚪ NSD -0.38% 0.191 →
0.191, 0.192
DropTip(unlen80, 5) ⚪ NSD -0.15% 0.034 →
0.0341, 0.0341
lapply(bigSplits, as.phylo) ⚪ NSD 0.66% 28.4 →
27.5, 28.3
lapply(someSplits, as.phylo) ⚪ NSD 0.32% 11.6 →
11.5, 11.7
PathLengths(tr2000, full = TRUE) ⚪ NSD 1.29% 25.2 →
25.1, 24.5
PathLengths(tr80, full = TRUE) ⚪ NSD 1.96% 0.0782 →
0.0764, 0.0768
PathLengths(tr80Unif, full = TRUE) ⚪ NSD 2.64% 0.0802 →
0.0779, 0.0782
RootTree(tr2000, 5) ⚪ NSD 0% 0.3 →
0.299, 0.301
RootTree(tr80, c("t3", "t36")) ⚪ NSD -0.01% 0.0574 →
0.0572, 0.0576
RootTree(tr80, "t3") ⚪ NSD 0.03% 0.0405 →
0.0404, 0.0405
RootTree(tr80, "t30") ⚪ NSD -0.11% 0.0405 →
0.0405, 0.0405
RootTree(unlen2k, 5) ⚪ NSD -0.34% 0.287 →
0.286, 0.29
RootTree(unlen80, c("t3", "t36")) ⚪ NSD -0.1% 0.0529 →
0.0529, 0.053
RootTree(unlen80, "t3") ⚪ NSD 0.05% 0.0356 →
0.0355, 0.0356
RootTree(unlen80, "t30") ⚪ NSD -0.47% 0.0356 →
0.0358, 0.0358
TreeDist::RobinsonFoulds(forest201.80) ⚪ NSD 0.29% 16.6 →
16.4, 16.6
TreeDist::RobinsonFoulds(forest21.888) ⚪ NSD 0.28% 3.13 →
3.11, 3.14
TreeTools:::path_lengths(tr80$edge, tr80$edge.length, FALSE) ⚪ NSD 2.05% 0.0695 →
0.068, 0.0682
TreeTools:::postorder_order(bal40) ⚪ NSD 0.88% 0.00148 →
0.00147, 0.00147
TreeTools:::postorder_order(bal40k) ⚪ NSD 0.12% 0.314 →
0.313, 0.315
TreeTools:::postorder_order(dbal40) ⚪ NSD -1.45% 0.00165 →
0.00166, 0.00168
TreeTools:::postorder_order(dbal40k) ⚪ NSD -0.14% 2.18 →
2.18, 2.19
TreeTools:::postorder_order(dpec40) ⚪ NSD -1.37% 0.00242 →
0.00245, 0.00246
TreeTools:::postorder_order(dpec40k) ⚪ NSD 0.09% 4440 →
4440, 4430
TreeTools:::postorder_order(drnd80) ⚪ NSD -1% 0.00389 →
0.00392, 0.00394
TreeTools:::postorder_order(nbal40) ⚪ NSD -0.59% 0.00201 →
0.00202, 0.00203
TreeTools:::postorder_order(nbal40k) ⚪ NSD -0.17% 2.43 →
2.43, 2.44
TreeTools:::postorder_order(npec40) ⚪ NSD 0.49% 0.00277 →
0.00277, 0.00275
TreeTools:::postorder_order(npec40k) ⚪ NSD -0.3% 4430 →
4440, 4450
TreeTools:::postorder_order(nrnd80) ⚪ NSD -0.25% 0.00464 →
0.00462, 0.00471
TreeTools:::postorder_order(pec40) ⚪ NSD -0.54% 0.00148 →
0.00147, 0.00149
TreeTools:::postorder_order(pec40k) ⚪ NSD -10.62% 0.314 →
0.314, 0.441
TreeTools:::postorder_order(rnd80) ⚪ NSD 1.02% 0.00176 →
0.00173, 0.00174

@ms609 ms609 merged commit 05036bb into main May 18, 2026
33 of 34 checks passed
@ms609 ms609 deleted the fix-polymorphism-whitespace branch May 18, 2026 12:27
@github-actions

Copy link
Copy Markdown

Performance benchmark results

Call Status Change Time (ms)
as.Splits(bigTrees) ⚪ NSD -2.25% 22.3 →
22.9, 22.3
as.Splits(someTrees) ⚪ NSD -0.84% 11.3 →
11.2, 11.4
Consensus(forest1k.888, check = FALSE) ⚪ NSD -2.19% 101 →
99.8, 108
Consensus(forest201.80, check = FALSE) ⚪ NSD -1.01% 4.1 →
4.06, 4.19
Consensus(forest21.260, 0.5, FALSE) ⚪ NSD -1.3% 1.24 →
1.24, 1.26
Consensus(forest21.260) ⚪ NSD -0.83% 1.26 →
1.26, 1.28
Consensus(forestMaj, 0.5, FALSE) ⚪ NSD -0.12% 2.99 →
2.94, 3.02
DropTip(tr2000, 5) ⚪ NSD -0.01% 17.4 →
17.3, 17.6
DropTip(tr80, 5) ⚪ NSD 4.25% 0.111 →
0.107, 0.106
DropTip(unlen2k, 5) ⚪ NSD -1.38% 0.208 →
0.211, 0.211
DropTip(unlen80, 5) ⚪ NSD 0.56% 0.041 →
0.0411, 0.0405
lapply(bigSplits, as.phylo) ⚪ NSD 0.3% 29.5 →
29.2, 29.8
lapply(someSplits, as.phylo) 🟣 ~same 3% 14.1 →
13.5, 13.7
PathLengths(tr2000, full = TRUE) 🟣 ~same 2.92% 16.9 →
16.4, 16.5
PathLengths(tr80, full = TRUE) ⚪ NSD 0.9% 0.106 →
0.105, 0.105
PathLengths(tr80Unif, full = TRUE) ⚪ NSD 1.11% 0.109 →
0.107, 0.107
RootTree(tr2000, 5) ⚪ NSD -0.35% 0.399 →
0.402, 0.399
RootTree(tr80, c("t3", "t36")) ⚪ NSD 0.65% 0.0722 →
0.0719, 0.0714
RootTree(tr80, "t3") ⚪ NSD 1.45% 0.0516 →
0.0512, 0.0504
RootTree(tr80, "t30") ⚪ NSD 1.12% 0.0517 →
0.0513, 0.0508
RootTree(unlen2k, 5) ⚪ NSD -3.54% 0.331 →
0.328, 0.349
RootTree(unlen80, c("t3", "t36")) ⚪ NSD 0.48% 0.0672 →
0.067, 0.0667
RootTree(unlen80, "t3") ⚪ NSD 0% 0.0443 →
0.0442, 0.0445
RootTree(unlen80, "t30") ⚪ NSD 0.49% 0.045 →
0.0445, 0.045
TreeDist::RobinsonFoulds(forest201.80) ⚪ NSD 2.67% 16.8 →
16.2, 16.6
TreeDist::RobinsonFoulds(forest21.888) ⚪ NSD -0.15% 3.47 →
3.47, 3.49
TreeTools:::path_lengths(tr80$edge, tr80$edge.length, FALSE) ⚪ NSD 1.92% 0.0961 →
0.0943, 0.0943
TreeTools:::postorder_order(bal40) ⚪ NSD 1.16% 0.00173 →
0.0017, 0.00172
TreeTools:::postorder_order(bal40k) ⚪ NSD 0.49% 0.435 →
0.431, 0.434
TreeTools:::postorder_order(dbal40) ⚪ NSD 1.61% 0.0018 →
0.00177, 0.00177
TreeTools:::postorder_order(dbal40k) ⚪ NSD -0.29% 2.03 →
2.02, 2.04
TreeTools:::postorder_order(dpec40) ⚪ NSD 1.16% 0.0026 →
0.00256, 0.00257
TreeTools:::postorder_order(dpec40k) ⚪ NSD 0.16% 3300 →
3300, 3300
TreeTools:::postorder_order(drnd80) ⚪ NSD 0.02% 0.00409 →
0.00408, 0.00409
TreeTools:::postorder_order(nbal40) ⚪ NSD -0.47% 0.00211 →
0.00212, 0.00211
TreeTools:::postorder_order(nbal40k) ⚪ NSD -0.15% 2.2 →
2.2, 2.2
TreeTools:::postorder_order(npec40) ⚪ NSD 0.72% 0.00291 →
0.00289, 0.00289
TreeTools:::postorder_order(npec40k) 🟣 ~same 0.38% 3330 →
3310, 3310
TreeTools:::postorder_order(nrnd80) ⚪ NSD 1.31% 0.00466 →
0.00461, 0.00458
TreeTools:::postorder_order(pec40) ⚪ NSD 1.74% 0.00172 →
0.00169, 0.00168
TreeTools:::postorder_order(pec40k) ⚪ NSD -0.28% 0.431 →
0.431, 0.433
TreeTools:::postorder_order(rnd80) ⚪ NSD -0.46% 0.00214 →
0.00215, 0.00215

@codecov

codecov Bot commented May 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.07%. Comparing base (ee67ba8) to head (db1238f).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #269   +/-   ##
=======================================
  Coverage   96.07%   96.07%           
=======================================
  Files          80       80           
  Lines        5905     5968   +63     
=======================================
+ Hits         5673     5734   +61     
- Misses        232      234    +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant